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Abstract 

Motivation: 

Term enrichment analysis facilitates biological interpretation by assigning to experimentally/computationally obtained 
data annotation associated with terms from controlled vocabularies. This process usually involves obtaining statistical 
significance for each vocabulary term and using the most significant terms to describe a given set of biological entities, 
often associated with weights. Many existing enrichment methods require selections of (arbitrary number of) the most 
significant entities and/or do not account for weights of entities. Others either mandate extensive simulations to obtain 
statistics or assume normal weight distribution. In addition, most methods have difficulty assigning correct statistical 
significance to terms with few entities. 

Results: 

Implementing the well-known Lugananni-Rice formula, we have developed a novel approach, called SaddleSum, that 
is free from all the aforementioned constraints and evaluated it against several existing methods. With entity weights 
properly taken into account, SaddleSum is internally consistent and stable with respect to the choice of number of most 
significant entities selected. Making few assumptions on the input data, the proposed method is universal and can thus be 
applied to areas beyond analysis of microarrays. Employing asymptotic approximation, SaddleSum provides a term-size 
dependent score distribution function that gives rise to accurate statistical significance even for terms with few entities. 
As a consequence, SaddleSum enables researchers to place confidence in its significance assignments to small terms that 
are often biologically most specific. 

Availability: 

Our implementation, which uses Bonferroni correction to account for multiple hypotheses testing, is available at 
http://www.ncbi.nlm.nih.gov/CBBresearch/qmbp/mn/enrich/ Source code for the standalone version can be down- 
loaded from ftp://ftp.ncbi.nlm.nih.gov/pub/qmbpmn/SaddleSum/ 

Contact: 

yyu@ncbi.nlm.nih.gov 
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1 Introduction 

A major challenge of contemporary biology is to ascribe 
interpretation to high-throughput experimental or compu- 
tational results, where each considered entity (gene or 
protein) is assigned a value. Biological information is 
often summarized through controlled vocabularies such 



as Gene Ontology (GO) ( Ashburner et al. 2000 1, where 
each annotated term includes a list of entities. Let w de- 
note a collection of values, each associated with an entity. 
Given w and a controlled vocabulary, enrichment analy- 
sis aims to retrieve the terms that by statistical inference 
best describe w, that is, the terms associated with enti- 
ties with atypical values. Many enrichment analysis tools 
have been developed primarily to process microarray data 



(Huang et al. 2009 1. In terms of biological relevance, the 
performance assessment of those tools is generally diffi- 
cult. It requires a large, comprehensive 'gold standard' 
vocabulary together with a collection of w's processed 
from experimental data, and with true/false positive terms 
corresponding to each w correctly specified. This invari- 
ably introduces some degree of circularity because the 
terms often come from curating experimental results. Be- 
fore declaring efficacy in biological information retrieval 
that is nontrivial to assess, an enrichment method should 
pass at least the statistical accuracy and internal consis- 
tency test. 



In their recent survey, Huang et al. ( 2009 1 list 68 dis- 
tinct bioinformatic enrichment tools introduced between 
2002 and 2008. Most tools share a similar workflow: 
given w obtained by suitably processing experimental 
data, they sequentially test each vocabulary term for en- 
richment to obtain its P-value (the likelihood of a false 
positive given the null hypothesis). Since many terms are 
tested, a multiple hypothesis correction, such as Bonfer- 
roni ( |Hochberg and Tamhane| |1987| l or false discovery 



rate (FDR) (Benjamini and Hochberg, 1995), is applied to 
P-value of each to obtain the final statistical significance. 
The results are displayed for the user in a suitable form 
outlining the significant terms and possibly relations be- 
tween them. Note that the latter steps are largely indepen- 



dent from the first. To avoid confounding factors, we will 
focus exclusively on the original enrichment P-values. 

Based on the statistical methods employed, the exist- 
ing enrichment tools can generally be divided into two 
main classes. The singular enrichment analysis (SEA) 
class contains numerous tools that form the majority of 
published ones ( |Huang et al.\ [2009). By ordering values 
in w, these tools require users to select a number of top- 
ranking entities as input and mostly use hypergeometric 
distribution (or equivalently Fisher's exact test) to obtain 
the term P-values. After the selection is made, SEA treats 
all entities equally, ignoring their value differences. 

The gene set analysis (GSA) class was pioneered by 
the GSEA tool ( |Mootha et oTl|2003||Subramanian et al.\ 
2005). Tools from this class use all values (entire w) 
to calculate P-values and do not require pre-selection of 
entities. Some approaches (Breitlin g et al.\ |2004| |A1-| 
|Shahrour efoT] [20071 [Blom et aZ.||20"07l|Eden et aZ.||20"09"| > 
in this group apply hypergeometric tests to all possible 
selections of top-ranking entities. The final P-value is 
computed by combining (in a tool-specific manner) the 
P-values from the individual tests. Other approaches use 
non-parametric approaches: rank-based statistics such as 



Wilcoxon rank-sum ( Breslin et al. 2004 ) or Kolmogorov- 
Smirnov-like (Moot ha et al.\ |2003[ |Sub ramania n et aL\ 
|2005l|Ben-Shaul et aZ.||2005l|Backes et alflMyty . When 
weights are taken into account, such as in GSEA ( |Subra-| 
manian et al. 2005 | l, statistical significance must be de- 



termined from a sampled (shuffled) distribution. Unfortu- 
nately, limited by the number of shuffles that can be per- 
formed, the smallest obtainable P-value is bounded away 
from 0. 

The final group of GSA methods computes a score for 
each vocabulary term as a sum of the values (henceforth 
used interchangeably with weights) of the m entities it 
annotates. In general, the score distribution pdf m (5) for 
the experimental data is unknown. By Central Limit The- 



orem, when m is large, Gaussian ( |Smid and Dorssers 



[2004} |Kim and Volsky] [2005] > or Student's t-distribution 
( |Boorsma efaT] [2005] |L~uo et al.\ |2009) can be used to 
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approximate pdf TO (5). Unfortunately, when the weight 
distributions are skewed, the required m may be too large 
for practical use. Evidently, this undermines the P-value 
accuracy of small terms (meaning terms with few enti- 
ties), which are biologically most specific. 

It is generally found that, given the same vocabulary 
and w, different enrichment analysis tools report diverse 
results. We believe this may be attributed to disagree- 
ment in P-values reported as well as that different meth- 
ods have different degree of robustness (internal consis- 
tency). Instead of providing a coherent biological under- 
standing, the array of diverse results questions the con- 
fidence of information found. Furthermore, other than 
microarray datasets, there exist experimental or compu- 



tational results such as those from ChlP-chip (Eden et al. 



2007 1, deep sequencing ( Sultan et al. 2008), quantitative 



proteomics (Sharma et al. 2009 1 and in silico network 



simulations (Stojmirovic and Yu 2007 2009), that may 
benefit from enrichment analysis. It is thus imperative to 
have an enrichment method that report accurate P-values, 
preserves internal consistency, and allows investigations 
of a broader range of datasets. 

To achieve these goals, we have developed a novel 
enrichment tool, called SaddleSum, that founds on the 



well-known Lugananni-Rice formula (Lugannani and 



Rice 1980) and derives its statistics from approximating 
asymptotically the distribution function of the scores used 
in the parametric GSA class. This allows us to obtain ac- 
curate statistics even in the cases where the distribution 
function generating w is very skewed and for terms con- 
taining few entities. The latter aspect is particularly im- 
portant for obtaining biologically specific information. 



2 Methods 

2.1 Mathematical foundations for SaddleSum 

We distinguish two sets: the set of entities M of size n and the controlled 
vocabulary V. Each term from V maps to a set Ai C Af of size m < n. 
From experimental results, we obtain a set w = {wj \ j £ AT} and 
ask how likely it is to randomly pick m entities whose sum of weights 
exceeds the sum S = SjeAl w j- 



Assume that the weights in w come independently from a continuous 
probability space W with the density function p such that the moment 
generating function p(t) = f w p(x)e tx dx exists for t in a neighbor- 
hood of 0. The density of S, sum of m weights arbitrarily sampled from 
w, can be expressed by the Fourier inversion formula 



P<if m (S) 



i r 



mK(it) —itS 



dt. 



(1) 



where Kit) = In p(t) denotes the cumulant generating function of p. 
The tail probability or P-value for a score S is given by 



Prob(S' > S) 



r 

■Is 



pdf m (S) dS. 



(2) 



We propose to use an asymptotic approximation to J5J, which improves 
with increasing m and S. 

|Daniels| j!954) derived an asymptotic approximation for the den- 
sity pdf m through saddlepoint expansion of the integral Q while the 
corresponding approximation to the tail probability was obtained by 
Lugannani and Rice 1980 . Let <f>(x) = exp(— x 2 /2)/\/27r and 



Let tj>(x) = exp(- 
= f °° <j>(t)dt denote respectively the density and the tail proba- 
bility of Gaussian distribution. Let A be a solution of the equation 

S = mK'(X). (3) 

Then, the leading term of the Lugananni-Rice approximation to the tail 
probability takes the form 

1 r 

y 



Prob(,S > S) = <S>(z) + 



<j>(z) + 0(m-' i/2 ), (4) 



where y = X^/mK"{X) and z = sgn(A)^/2(A5 - mK(X)). Ap- 
propriate summary of derivation of |4| is provided in Supplementary 
Materials. 

|DanielsH1954) has shown that eq. J5J has a unique simple root under 
most conditions and that A increases with S, with A = for S = 
m {W) where (W) = J w xp(x) dx is the mean of W. While the 
approximation |4j is uniformly valid over the whole domain of p, its 
components need to be rearranged for numerical computation near the 
mean. When S»m {W), <j>(z)/y dominates and the overall error is 
0(m, _1 ) jDaniels|[l987) . 

SaddleSum, our implementation of Lugananni-Rice approximation 
for computing enrichment P-values, first solves eq. |5J for A using New- 
ton's method and then returns the P-value using J3J. The derivatives of 
the cumulant generating function are estimated from w: we approxi- 
mate the moment generating function by p(t) — 



e tw i , and 



then ,?£"'(*) = p'(t)/p(t) and K"(t) = p"(t)/p(t) - (K'(t)) 2 . Since 
the same w is used to sequentially evaluate P-values of all terms in V, 
we retain previously computed A values in a sorted array. This allows 
us, using biliary search, to reject many terms with P-values greater than a 
given threshold without running Newton's method or to bracket the root 
of |5} for faster convergence. More details on the SaddleSum implemen- 
tation and evaluations of its accuracy against some well-characterized 
distributions are in Section 2 of Supplementary Materials. When run as 
a term enrichment tool, SaddleSum reports E- value for each significant 
term by applying Bonferroni correction to the term's P-value. 
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2.2 Gene Ontology 

The assignment of human genes to GO terms was taken from 
the NCBI gene2go file (ftp://ftp.ncbi.nih.gov/gene/DATA/gene2go.gz) 
downloaded on 07-02-2009. After assigning all genes to terms, we re- 
moved all redundant terms - if several terms mapped to the same set of 
genes, we kept only one such term. For our statistical experiments we 
kept only the terms with no less than five mapped genes within the set of 
weights considered and hence the number of processed terms varied for 
each realization of sampling (see below). 

2.3 Information flow in protein networks 

ITM Probe jStojmirovic and Yu| [2009} is an implementation of the 
framework for exploring information flow in interaction networks jSto-| 
jmirovic and Yu 2007 1. Information flow is modeled through discrete 
time random walks with damping - at each step the walker has a certain 
probability of leaving the network. Although ITM Probe offers three 
modes: emitting, absorbing and channel, we only used the simplest, 
emitting mode, to provide examples illustrating issues of significance 
assignment. The emitting mode takes as input one or more network pro- 
teins, called sources, and a damping factor a. For each protein node in 
the network, the model outputs the expected number of visits to that node 
by random walks originating from the sources, thus highlighting the net- 
work neighborhoods of the sources. The damping factor determines the 
average number of steps taken by a random walk before termination: 
a = 1 corresponds to no termination while a = leads to no visits 
apart from the originating node. For our protein-protein interaction net- 
work examples, we used the set of all human physical interactions from 
the BioGRID jBreitkreutz e7aT1|2008) , version 2.0.54 (July 2009). The 
network consists of 7702 proteins and 56400 unique interactions. Each 
interaction was represented by an undirected link. A link carries weight 
2 if its two ends connect to the same protein and 1 otherwise. 

2.4 Microarrays 

From the NCBI Gene Expression Omnibus (GEO) (Barrett et fl/l]|2"009") , 
we retrieved human microarray datasets with expression log 2 ratios 
(weights) provided, resulting in 34 datasets and 136 samples in total. 
For each sample, when multiple weights for the same gene were present, 
we took their mean instead. This resulted in a w where each gene is 
assigned a unique raw weight. For evaluations, we also used another 
version of w where negative weights were set to zero. This version fa- 
cilitated investigation of up-regulation while keeping the down-regulated 
genes as part of statistical background. 

2.5 Evaluating accuracy of P- values 

By definition, a P-value associated with a score is the probability of that 
score or better arising purely by chance. We tested the accuracy of re- 
ported P-values reported by enrichment methods via simulations on 'de- 
coy' databases, which contained only terms with random gene assign- 
ments. For each term from the decoy dataset and each set of weights 
based on network or microarray data, we recorded the reported P-value 



and thus built an empirical distribution of P-values. If a method re- 
ports accurate P-values, the proportion of runs, which we term empirical 
P-value, reporting P-values smaller than or equal to a P-value cutoff, 
should be very close to that cutoff. We show the results graphically by 
plotting on the log-log scale the empirical P-value as a function of the 
cutoff. 

For each given list of entities AT, be it from the target gene set of a 
microarray dataset or the set of participating human proteins in the in- 
teraction network, we produced two types of decoy databases. The first 
type was based on GO. We shuffled gene labels 1000 times. For each 
shuffle, we associated all terms from GO with the shuffled labels to re- 
tain the term dependency. This resulted in a database with approximately 
5 X 10 6 terms (1000 shuffles times about 5000 GO terms). In the sec- 
ond type, each term, having the same size m, was obtained by sampling 
without replacement m genes from Af ■ The databases from this type 
(one for each term size considered) contained exactly 10 7 terms. The 
evaluation query set of 100 w's from interaction networks was obtained 
by randomly sampling 100 proteins out of 7702 and running ITM Probe 
with each protein as a single source. The weights for source proteins 
were not considered since they were prescribed, not resulting from sim- 
ulation. Each run used a = 0.7, without excluding any nodes from the 
network. For microarrays, the set of 1 36 samples was used. Since both 
query sets are of size KilO 2 , the total number of w-term matches was 
^10 9 . 

2.6 Student's t-test (used by GAGE and T-profller) 

Similar to SaddleSum, t-test approaches are based on sum-of-weights 
score but use the Student's t-distribution to infer P-values. As before, let 
Wj denote the weight associated with entity j £ Af, let AA denote the 
set of m entities associated with a term from vocabulary and let AA 1 = 
Af\A4. For any setS C J\[ of size |e>|,let£,s = ygj Ejgs w j denote 
the mean weight of entities in S and let = rgh^i j w 3 ~ x s) 2 
be their sample variance. 

GAGE ( Luo et al. 2009 1 enrichment tool uses two sample t-test as- 
suming unequal variances and equal sample sizes to compare the means 
over AT and AA . The test statistic is 



yJs 2 M /m + s%./m 

and the P-value is obtained from the upper tail of the Student's t 
distribution with degrees of freedom 



v = (m — 1) 



s 4 + 



T-profiler (Boorsma et al. [ [2005 1 compares the means over AA and AA' 
using two sample t-test assuming equal variances but unequal sample 
sizes. The pooled variance estimate is given by 

2 (m - l)s 2 M + {n-m- l)s 2 M , 



and the test statistic is 



X M - X M' 

,',/- + — ' 
V m n — m 
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The T-profiler P-value is then obtained from the tail of the Student's t- 
distribution with v = n — 2 degrees of freedom. 

2.7 Hypergeometric distribution 

Methods based on hypergeometric distribution or equivalently, Fisher's 
exact test, use only rankings of weights and require selection of 'sig- 
nificant' entities prior to calculation of P-value. We first rank all entities 
according to their weights and consider the set C of c entities with largest 
weights. The number c can be fixed (say 50), correspond to a fixed per- 
centage of the total number of weights, depend on the values of weights, 
or be calculated by other means. The score S for the term M is given by 
the size of the intersection, C D M, between C and Ai. This is equiva- 
lent to setting S = ^ZjgjVt Wj with Wj = 1 for j £ C and otherwise. 
The P-value for score S is 

min(c,m) f m \ (n — m\ 

Prob(5>5)= J2 n ■ 

Hence, the P-value measures the likelihood of score 5 or better over all 
possible ways of selecting c entities out of Af, with m entities associated 
with the term investigated. 

In each of our P-value accuracy experiments we used two variants 
of the hypergeometric method, one taking a fixed percentage of nodes 
and the other taking into account the values of weights. For microar- 
ray datasets, the fist variant took 1% of available genes (HGEM-PN1) 
while the second select genes with four fold change or more (HGEM- 
F2). In experiments based on protein networks, we took 3% of available 
proteins (231 entities) for the first variant (HGEM-PN3) and used the 
participation ratio formula to determine c in the second (HGEM-PR). 
Participation ratio (Stojmirovic and Yu 2007) is given by the formula 



(E 



E 



We chose a smaller percentage of weights for microarray-based data ( 1 % 
vs 3% for data derived for networks) because the microarray datasets 
generally contained measurement for more genes than the number of 
proteins in the network. 

2.8 mHG score 

Instead of making a single, arbitrary choice of c and applying hyperge- 
ometric score, mHG method implemented in the GOrilla package ( Eden 
\et a/.| [2009 1 considers all possible c's. The mHG score is defined as 

min(c,m) fm\ (n — m\ 

mHG = mm ^ 7H\ > 

Kc<n * — * ( 

where k is the number of entities annotated by the term Ai among the c 
top-ranked entities. The exact P-value for mHG score is then calculated 
by using a dynamic programming algorithm developed by Eden et al. 
(Eden et oT1|2007) . For our experiments we used an implementation in 
C programming language that was derived from the Java implementation 
used by GOrilla. The implementation uses a truncated algorithm that 
gives an approximate P-value with improved running speed. 



2.9 Retrieval stability with respect to choice of c 

To evaluate consistency of investigated methods, we compared the 
sets of significant terms retrieved from GO using different numbers of 
nonzero weights as input. For each w, we sort in descending order the 
weights associated with entities. With each c selected, we kept c largest 
weights unchanged and set the remaining to to arrive at a modified set 
of weights w|C. We did not totally exclude the lower weights but kept 
them under consideration to provide statistical background. We submit- 
ted w|C for analysis and obtained from each statistical method a set of 
enriched terms ordered by their P-value. In Fig.[2]\ and Supplementary 
Fig. S3, we displayed the actual five most significant terms retrieved with 
their P-values for selected examples of weight sets. To investigate on a 
larger scale the retrieval stability to c changes, we computed for each 
method the overlap between sets of top ten terms from two different c's 
for the w sets mentioned in 'Evaluating accuracy of P-values' and then 
took the average (Figffh). 



3 Results 

We compared our SaddleSum approach against the fol- 
lowing existing methods: Fisher's exact test (HGEM 



(Boyle et al. 2004 1), two sample Student's t-test with 



equal (T-profiler ( Boorsma et al. 2005 1) and unequal 
(GAGE ( |Luo et ai\ |2009| l) variances, and mHG score 
( |Eden et al.\ [20071 [2009}. Based on data from both mi- 
croarrays and simulations of information flow in protein 
networks, the comparison shown here encompassed (in 
order of importance) evaluation of P-value accuracy, rank- 
ing stability and running time. Accurate P-value reflects 
the likelihood of a false identification and thus allows for 
comparison between terms retrieved even across experi- 
ments. Incorrect P-values therefore render ranking sta- 
bility and algorithmic speed pointless. Accurate P-values 
without ranking stability question the robustness of bio- 
logical interpretation. For pragmatic use of an enrichment 
method, even with accurate statistics and stability, it is still 
important to have reasonable speed. 

3.1 Accuracy of reported P-values 

The term P-value reported by an enrichment analysis 
method provides the likelihood for that term to be en- 
riched within w. To infer biological significance us- 
ing statistical analysis, it is essential to have accurate P- 
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P-value cutoff P-value cutoff 



Figure 1: Empirical P-values versus P-value cutoffs reported for investigated enrichment methods. Methods with accurate statistics have their 
curves follow the dotted line closely over the whole range. Each curve was constructed by aggregating the results of approximately 10 9 GO-based 
decoy term queries. Displayed on the left (right) are results using weights derived from protein network information flow simulations (microarrays). 
In microarray plots for SaddleSum, T-profiler and GAGE, full lines indicate the results where negative weights were set to 0, while dashed lines 
show the results using all weights. The reason that HGEM curves run below the theoretical line and parallel to it is that every curve is an aggregate 
of many curves, each of which (i) represents a single sample of weights determining parameters to be fed into hypergeometric distribution, and (ii) 
is a step function touching the theoretical line and dropping below it. Merging curves from many samples produces the effect seen in our plots. 



values. We analyzed the accuracy of P-values reported 
by the investigated approaches through simulating «10 9 
queries and comparing their reported and empirical P- 
values. 

Results based on querying databases with fixed term 
sizes are shown in Supplementary Figs. SI and S2. Shown 
in Fig.[T] are the results for querying GO-based gene- 
shuffled term databases, which retain the structure of the 
original GO as a mixture of terms of different sizes orga- 
nized as a directed acyclic graph where small terms are in- 
cluded in larger ones. The curves for all methods in Fig.[T] 
therefore resemble a mixture of curves from Supplemen- 
tary Figs. SI and S2 albeit weighted towards smaller-sized 
terms. 

For weights from both network simulations and mi- 
croarrays, SaddleSum as well as the methods based on 
Fisher's exact test (HGEM and mHG) report P-values that 
are acceptable (within one order of magnitude from the 
theoretical values). For HGEM and mHG, this is not sur- 



prising because our experiments involved shuffling entity 
labels and hence followed the null model of the hyperge- 
ometric distribution. On the other hand, the null model of 
SaddleSum and the t-test methods assumes weights drawn 
independently from some distribution (sampling with re- 
placement). For terms with few entities (m < 100), the 
difference between the two null models is minimal and the 
P-value accuracy assessment curves for SaddleSum run 
as close to the theoretical line as those for HGEM meth- 
ods. For m > 100, SaddleSum gives more conservative 
P-values for terms with large sums of weights (Supple- 
mentary Figs. SI and S2). In practice, this has no signifi- 
cant effect to biological inference. Large terms would be 
still selected as significant given a reasonable P-value cut- 
off and accurate P-values are assigned to small terms that 
are biologically specific. 

Two-sample t-test with unequal variances as used by 
GAGE package reports P-values so conservative that they 
are often larger than 0.01 and hence not always visible 
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GO term 

methyltransferase complex... (12) 

nucleoplasm part... (234) 

viral reproduction... (59) 

regulation of gene expression... (758) 

regulation of transcription... (643) 

nucleoplasm... (509) 

nuclear lumen... (642) 

intracellular organelle lumen... (723) 

organelle lumen... (753) 

membrane-enclosed lumen... (758) 

transcription regulator activity... (703) 

nucleus... (1738) 

transcription... (784) 

regulation of metabolic process... (1113) 

p53 binding... (9) 

negative regulation of cyclin-dependent protein k... (5) 

histone methyltransferase activity... (6) 

regulation of macromolecule metabolic process... (960) 

cellular biopolymer metabolic process... (2013) 

cellular macromolecule metabolic process... (2061) 

macromolecule metabolic process... (2175) 

biopolymer metabolic process... (2114) 

binding... (4688) 

protein binding... (4222) 

intracellular part... (3526) 



Tcell receptor complex... (12) 

immune system process... (390) 

Tcell activation... (64) 

receptor complex... (65) 

external side of plasma membrane... (31) 

lymphocyte activation... (87) 

immune response... (254) 

leukocyte activation... (112) 

Tcell differentiation... (28) 

nucleus... (1748) 

nuclear part... (749) 

nuclear lumen... (598) 

RNA processing... (231) 

nucleoplasm... (458) 

nucleoplasm part... (215) 

RNA splicing... (159) 

receptor signaling complex scaffold activity... (10) 

protein complex scaffold... (12) 

positive regulation of T cell activation... (30) 

regulation of T cell activation... (40) 

antigen receptor-mediated signaling pathway... (12) 

nucleobase, nucleoside, nucleotide and nucleic ac... (1307) 

cell activation... (127) 

S phase... (17) 

lymphocyte differentiation... (44) 
mRNA processing... (143) 

RNA splicing, via transesterification reactions... (123) 
RNA splicing, via transesterification reactions w... (122) 
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Figure 2: P-value consistency and retrieval stability. (A) The output of ITM Probe emitting mode with human MLL protein (histone methyltrans- 
ferase subunit) as the source (top) and the log 2 ratios from the human T cell signaling microarray GSM89756 (bottom) were processed by each 
of the five investigated statistical methods with varying number of weighted entities included for analysis (All and Pos include all entities; All 
uses raw weights while Pos sets all negative weights to 0). The P-values for GO terms from the union of the sets of top-five hits for each method 
and different numbers of selected entities, are indicated by colors of the corresponding cell. Red dots show the actual top five hits for the method 
represented by that column. (B) Degree of overlap between sets of significant GO terms. Each panel corresponds to a single method with different 
numbers of entities used for analysis, with the results from microarray queries shown in the upper triangle and those based on network flow shown 
in the lower triangle. Color in each cell indicates the average pairwise overlap between the two sets of top-ten entities retrieved. For example, 
consider the light orange colored cell (horizontally labeled by 100 and vertically labeled by 500) in the mHG panel. This indicates that on average 
the top-ten terms retrieved by mHG using top 100 and top 500 network flow proteins share about three common terms. 



3 RESULTS 



7 



in our accuracy plots. This effect persists even for m as 
large as 500. This might be because the number of de- 
grees of freedom used is considerably small. In addition, 
its test statistic (eq. Q) emphasizes the estimated within- 
term variance s 2 M that is typically larger than the overall 
variance sj/. 

On the other hand, T-profiler generally exaggerates 



(Luo et al. 2009! significance because it uses the t- 



distribution with a large number of degrees of freedom 
(n — 2). Although some small terms may appear bio- 
logically relevant (as in Fig. [2}, one should not equate 
these exaggerated P-values with sensitivity. For microar- 
ray data, the log 2 ratios are almost symmetrically dis- 
tributed about (Supplementary Fig. S4). The distribu- 
tion of their sum is close to Gaussian. However, T-profiler 
still significantly exaggerates P-values for terms whose 
m < 25 (Supplementary Fig. S2). The statistical accu- 
racy of T-profiler worsens when negative log 2 ratios are 
set to 0. The reason for doing so is that allowing weights 
within each term to cancel each other may not be biolog- 
ically appropriate. GO terms may cover a very general 
category where annotations may not always be available 
for more specific subterms. Subsequently, terms may get 
refined and new terms may emerge. In such situation, it 
is desirable to discover terms that have genes that are sig- 
nificantly up-regulated even if many genes from the same 
term are down-regulated. 



3.2 Stability 

P-value accuracy, although the most important crite- 
rion, measures only performance with respect to non- 
significant hits, that is, the likelihood of a false positive. It 
is also necessary to consider the quality of enrichment re- 
sults in terms of the underlying biology. Testing the qual- 
ity directly, as described in the introduction, is not yet fea- 
sible. Instead we evaluated internal consistency of each 
method with respect to the number of top-ranked entities 
used for analysis. Fig.[2j\ shows the change of P-values 
reported for the top five GO terms with respect to the num- 



ber of selected entities using two examples with weights 
respectively from network flow simulation and microar- 
ray. Additional examples are shown in Supplementary 
Fig. S3. Results from evaluating the overall consistency 
of the best ten terms retrieved are shown in Fig.|2j3. 

Both HGEM and mHG methods are highly sensitive to 
the choice of c, the number of entities deemed significant. 
With a small c, their sets of significant terms resemble the 
top terms obtained by SaddleSum, while large values of 
c render very small P-values for large-sized terms (often 
biologically non-specific). This is mainly because HGEM 
and mHG treat all selected significant entities as equally 
important without weighting down less significant enti- 
ties, the collection of which may out vote the most signif- 
icant ones. Hence, although mHG considers all possible 
c values, to obtain biologically specific interpretation, it 
might be necessary to either remove very large terms from 
the vocabulary or to impose an upper bound on c. In that 
respect, mHG is very similar to the original GSEA method 



(Mootha et al. 2003 i, which also ignored weights. The 
authors of GSEA noted that the genes ranked in middle of 
the list had disproportionate effect to their results and pro- 
duced an improved version of GSEA ( Subramanian et al. 
|2005[ > with weights considered. 

GAGE does not show strong consistency because many 
P-values it reports are too conservative and fall above the 
0.01 threshold we used. Consequently, the best overlap 
between various cutoffs is about 5 (out of 10) for network 
flow examples and 4 for microarray examples (Fig.[2}3). T- 
profiler shows great internal consistency. Unfortunately, 
as shown in Fig.[T] Supplementary Figs. SI and S2, it re- 
ports inaccurate P-values, especially for small terms. This 
is illustrated in the top panel of Fig.|2|\, where T-profiler 
selects as highly significant the small terms (with 5,6 and 
9 entities), which are deemed insignificant by all other 
methods. The same pattern can be observed in Supple- 
mentary Fig. S3, although the severity is tamed for mi- 
croarrays. Using weights for scoring terms, SaddleSum is 
also stable with respect to the choice of c but with accurate 
statistics. 
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Total running time 


Average time per query 


Method 


network 


microarray 


network 


microarray 


SaddleSum 


558 


872 


0.56 


0.64 


HGEM 


501 


615 


0.50 


0.45 


T-profiler 


446 


586 


0.45 


0.43 


GAGE 


499 


651 


0.50 


0.48 


mHG 


2433 


3407 


2.43 


2.51 



Table 1: Running times of evaluated enrichment statistics algorithms 
(in seconds). We queried GO ten times with each of the five examined 
enrichment methods using weights from 100 network simulation results 
and 136 microarrays (same datasets used for P-value accuracy experi- 
ments). Running times for P-value calculations on dual-core 2.8 GHz 
AMD Opteron 254 processors (using a single core for each run) aggre- 
gated over all samples are shown on the left, while average times per 
query are shown on the right. The HGEM method used 100-object cut- 
off. 

3.3 Speed 

In terms of algorithmic running time (Table [TJ, paramet- 
ric methods relying on normal or Student's t-distribution 
require few computations. Methods based on hypergeo- 
metric distribution, if properly implemented, are also fast. 
On the other hand, non-parametric methods can take sig- 
nificant time if many shufflings are performed. Based on 
dynamic programming, mHG method can also take ex- 
cessive time for large terms. SaddleSum has running time 
that is only slightly longer than that of parametric meth- 
ods. 



4 Discussion 

Approximating the distribution of sum of weights by sad- 
dlepoint method, our SaddleSum is able to adapt itself 
equally well to distributions with widely different prop- 
erties. The reported P-values have accuracy comparable 
to that of the methods based on the hypergeometric distri- 
bution while requiring no prior selection of the number of 
significant entities. 



While our results show that GAGE method suffers from 
reduced sensitivity, it should be noted that it forms only a 
part of GAGE algorithm. GAGE was designed to com- 
pare two groups of microarrays (for example disease and 
control) by obtaining an overall P-value. In that scheme, 
the P-values we evaluated are used only for one-on-one 
comparisons between members of two groups. By com- 
bining one-on-one P-values (which are assumed indepen- 
dent), the overall P-value obtained by GAGE can become 
quite small. 

The assumed null distribution by T-profiler ( |Boorsma| 
\et al.\ |2005| l is close to Gaussian. It has been commented 
( |Luo et ai\ |2009| l that its statistics are similar to that of 
PAGE ( |Kim and Volskyl [2005] l, which uses Z-test. Natu- 
rally, the smallest, and likely exaggerated, P-values occur 
when evaluating small terms. For that reason, PAGE does 
not consider terms with less than 10 entities, which we 
included in our evaluation solely for the purpose of com- 
parison. 

Our network simulation experiments produce very dif- 
ferent weight profiles (Supplementary Fig. S4) than that 
of microarrays. These weights are always positive and 
skewedly distributed. Even after summing many such 
weights, the distribution of the sum is still far from Gaus- 
sian in the tail. Therefore, T-profiler and GAGE are un- 
able to give accurate statistics. Overall, our evaluations 
clearly illustrate the inadequacy, even for large terms, of 
assuming nearly Gaussian null distribution when the data 
is skewed. While Central Limit Theorem does guaran- 
tee convergence to Gaussian for large m, the convergence 
may not be sufficiently fast in the tail regions, which in- 
fluence the statistical accuracy the most. 

As presented here, SaddleSum uses given w both for 
estimating the m-dependent score distribution and for 
scoring each term. If a certain distribution of weights are 
prescribed, it is possible to adapt our algorithm to take 
a histogram for that distribution as input and use experi- 
mentally obtained weights for scoring only. 

A possible way to improve biological relevance in re- 
trieval is to allow for term-specific weight assignment. 
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For example, a gene associated with a GO term can be 
assigned a 'NOT' qualifier to indicate explicitly that this 
gene product is not associated with the term considered. 
A way to use this information would be to change the 
sign of the weight for such a gene (from positive to nega- 
tive or vice versa), but only when scoring the terms where 
the qualifier applies. Hence, potentially every term could 
be associated with a specific weight distribution. While 
all methods using weights can implement this scheme, 
SaddleSum is particularly suitable for it because it han- 
dles well the small terms and skewed distributions, where 
changing the sign for a single weight can have a consid- 
erable effect. This procedure can be generalized so that 
each gene in a term carries a different weight. 



Several authors (Huang et al. 2009 Goeman and 



Buhlmann] [2007) |Gold et al.\ |2007] l have raised the is- 
sue of correlation between weights of entities: gener- 
ally the weights of biologically related genes or proteins 
change together and therefore a null model assuming in- 
dependence between weights may result in exaggerated 
P-values. In principle, a good null model is one that can 
bring out the difference between signal and noise. To 
what level of sophistication a null model should be usu- 
ally is a trade-off between statistical accuracy and retrieval 
sensitivity. Using protein sequence comparison for exam- 
ple, ungapped alignment enjoys a theoretically character- 
izable statistics (Kar lin and Altschull [T99 0) but is not as 



sensitive as the gapped alignment ( |Altschul et al.\ |1997[ ), 
where the score statistics is known only empirically be- 
cause the null model allows for insertions and deletions 
of amino acids. Incorporating insertion and deletion into 
the null model made all the difference in retrieval sensitiv- 
ity. This is probably because insertions/deletions do occur 
abundantly in natural evolution of protein sequences. The 
ignorance of protein sequence correlations, assumed by 
both ungapped and gapped alignments, does not seem to 
cause much harm in retrieval efficacy. 

Although SaddleSum assumes weight independence 
and thus bears the possibility of exaggerating statistical 
significance of an identified term, it mitigates this issue 



by incorporating the entire w in the null distribution. It 
includes the entities with extreme weights that clearly rep- 
resent 'signal' and not 'noise', bringing higher the tail of 
the score distribution and thus larger P-values. Indeed, as 
shown by examples in Fig.|2j\ and Supplementary Fig. S3, 
SaddleSum does not show unreasonably small P-values. It 
should also be noted that SaddleSum is designed for the 
simple case where a summary value is available for each 
entity considered - its use for analyzing complex microar- 
ray experiments with many subjects divided into several 
groups is beyond the scope of this paper and care must be 
exercised when using it in this context. 

SaddleSum is a versatile enrichment analysis method. 
Researchers are free to process appropriately their exper- 
imental data, produce a suitable w as input, and receive 
accurate term statistics from SaddleSum. Since it does 
not make many assumptions about the distribution of data, 
we foresee a number of additional applications not lim- 
ited to genomics or proteomics, for example to literature 
searches. 
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A Saddlepoint approximation of tail 
probabilities 

References about saddlepoint approximations of the tail 



probabilities of random variables are abundant Lugan- 
|nani and Rice1 ( [T^ ; [DameIsl ( fT987l ); |Wood et q/.| | [T993) ; 



Jensen ( 1995 1. For completeness of our exposition we 
here present the derivation of the Lugannani-Rice formula 



Lugannani and Rice ( 1980| l, relying extensively on expo- 
sitions by Daniels |Daniels| ( fl987| ) and Woods, Booth and 
Butler |Wood ITaE] ( fT993) . 

Let X be a continuous random variable supported on a 
subset of R. We will assume that its probability density 
function (PDF), denoted by fx exists and that its mo- 
ment generating function (MGF), defined by px(t) — 
fx{x)e tx dx converges for real t g [a, b] where 
a < < b. Recall that px{it) gives the characteristic 
function of X, that is, the Fourier transform of fx and 
that fx can hence be recovered by the Fourier inversion 
formula: 



fx(x) 



1 

27 
1 

27 

1 

2?ri 



e- ltx p x (it)dt 
e K x (it)-itx dt 

e K x {t)-tx ^ 



(6) 
(7) 
(8) 



where Kx it) = In px (t) denotes the cumulant generat- 
ing function (CGF) of X. The tail probability or P-value 
for a value y (with respect to X), which we will denote by 
Qx (y) can be expressed as 



My) = Prob (* >y)= fx{x) dx (9) 

Jy 

/>oo pioo 

e K x {t)-tx dtdx (1Q) 

f y J —zoo 



1 

2ni 



Let S denote the sum of to independent, identically 
distributed random variables. We write S = Y^JLi Xj, 
where fx j — fx for all j. Our goal is to derive an asymp- 
totic approximation for the tail probability Qg. It can be 



easily shown that ps(t) = Px(t) and hence by ( 13 1 



Qs(s) = 



2ni 



c+ioo j. 
e mK x (t)-U °± 



(13) 



To produce our approximation we note that the main con- 



tributions to the integral ( 1 3 i occur in the neighborhood 
of the pole at t = and in the neighborhood of the saddle 
point t = X where the exponent I(t) = mKx (t) — ts has 
a maximum, that is, where I' it) = 0. The saddlepoint 
condition is thus 



mK'x(X), 



or alternatively 



(x--)f x (x)e Xx dx 

\ TO/ 



(14) 



(15) 



Let E(X) denote the expectation of X. Daniels Daniels 



1954) > has shown that eq. ( fl4) > has a unique simple root 



under most conditions. The value of A increases with s, 
with sgn(A) = sgn(s — mE(l)). 

When s toE(X), the contribution of the pole at 



t = to ( 13 1 is very small and to obtain an asymptotic 



approximation to Q$ one can proceed by expanding I(t) 
as a Taylor's series about t = A and integrating the re- 
sulting integral term-by-term [Daniels ( |1987| l. However, 
as s gets closer to the mean E(S) — mE(X), such ap- 
proximation performs poorly and in fact is unbounded at 
the mean. The essence of the method of Bleistein ( 1966) 
as applied to by Lugannani and Rice Lugannani and 



Rice ( 1980) is to produce a transformation of the integral 
— : / / e Kx(t ^~ tx dx dt (11) ( 13 1 that would take into account the pole and hence to 



c+ioo 



,K x (t)-ty 



dt 



2iri J c—ioQ t 



(12) 



where c € (0, b) is a constant introduced to avoid the pole 
at t = 0. 



produce an approximation uniformly valid over the whole 
range of S. 

Make a transformation from t to a new variable z by 



Kn(z) — zz = mKx(t) — ts, 



(16) 
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where N denotes the Gaussian random variable with PDF 

In{x) = <t>{x) = exp(— x 2 /2)/^/2tt and Qn(x) — 



= J 4>{t) dt and s satisfies ( 14 1. The value z is 
chosen so that the minimum of the left side is equal to 
the minimum of the right side, which occurs when t = X. 
Since K^{z) 



1 2 

kz , eq. 



16 1 becomes 



—z 
2 



zz = mK x (t) - tmK' x (X). 



To find z, we set z — z and t = A in ( 17 1 to get 

- l -z 2 = m{K x {X)-XK> x {X)) 



(17) 



(18) 



or, taking the sign for z to be equal to the sign of A, 

z = sgn(A) J {2m{XK' x {X) - K X {X)) (19) 



sgn(X)d(2(Xs-mK x (X)). 



(20) 



The transformation ( 17 i maps the region [0, A] in t- 



space into the region [0, z] in z-space. The local behavior 
of mK x (t) — tmK' x (A), which vanishes at t = and has 
zero derivative at t = X is reproduced by ^z 2 — zz with 
similar behavior at z = and z — z. Let u = z — z. Then, 



1 



= mK x (t) - ts - mK x (X) + As. 



Expanding mK x (t) — ts about t = X we have 
1 -u 2 = 1 -rnK' x (X)v 2 + 1 - m K' x \Xy + ... 



1 



mK x {X)v 2 (l + a 3 v + a 4 v 2 + . . .) 



(21) 

(22) 
(23) 



where v = t — A and ov 



2^"' (A) 
n!if"(A)- 



Hence, 



m,K' x (X)v (l + + a^v 2 



,1/2 



(24) 



It follows that du/dv and dv/du are nonzero for all u £ 
[0, A] and u e [0, z], respectively. Since K x is analytic in 
the region of interest, u(v) and v(u) are analytic over the 
same intervals. Obviously, the same conclusion follows 
for z as a function of t and i as a function of z. By the 
inverse function theorem, the transformation tttz can be 



extended to a bijection between complex neighborhoods 
of [0, A] and [0,z]. 

The integral ( p"3} now transforms (using Cauchy's the- 
orem) into 



Qs(s) 



1 



K N {z)-zz 



' d—ioo 

where d > 0. For small t, we can write 



1 dt 

t dz 



dz, (25) 



z w z| t=0 + i 



t=o 



t=o 



(26) 



When A ^ and hence z / 0, differentiating 
obtain 

dz _ mK' x (t) - mK' x (X) 

dt z — z 



17 1 we 



(27) 



while when A = 0, (24i implies dz/dt = du/dv 
J mK x (0) when t is small. Thus, 



dz 
eft 



(28) 



and therefore, for small t, z 
Let 

U(z) 



l(s-mE(X)) if A ^0, 
y/mK x (0) if A = 

Ct where C is a constant. 



(29) 



1 dt 1 

t dz z 

By expanding mK x (t) — ts about t = 0, it can be shown 
that, lim z _j.o U(z) < oo and, since dt/dz is analytic, 
U(z) is analytic in the neighborhood of z = that in- 



cludes z. Therefore, we can rewrite the integral (25 i as 



i 

1 

2tt? ./,/ 



ci+zoo 



s dz 



d—ioo 
d+ioo 



e K N (z)~zz ™ (3Q) 
Z 

M*)-**U(z)dz. (31) 



The singularity has now been isolated into (30i, which 



by comparing with ( 13 I, we recognize to equal $(5). On 
the other hand, U (z) can be expanded as a Taylor's se- 
ries around the saddlepoint z — z and integrated to obtain 



an asymptotic series for (31 1. For the first-order approx- 
imation, that is, the leading behavior, we only take the 
constant term at z. Let 



y 



, dz 
dt 



t- 



X- 



du 
dv 



XdmK' x (X) 



(32) 



B SADDLESUM IMPLEMENTATION 



14 



Then, U(z) = 1/y — 1/z and the integral (31 1 becomes 



U(z) 



1 



z+ioo 



3 K N (z)-zz dz = 



1 1 

y z 



(z). 



(33) 



Thus, we have obtained the Lugananni-Rice formula: 

Prob(5 > s) = $(z) + Q - i J </>(£), (34) 



with z(s) given by ( 14 1 and (20 1. 



B SaddleSum implementation 

As mentioned in the main text, our SaddleSum algorithm 
approximates term P-values by first solving eq. ( fT~4| > for 
A using Newton's method and then using the Lugananni- 



Rice formula (34 1. The key step is estimation of A. Since 
the moment-generating function p of the underlying space 
W is not known, we estimate it (and its derivatives) using 
w. Given sufficiently many weights (n » 1), the results 
can be quite accurate (see below). One limitation of this 
approach is that our approximation can only accept scores 
not greater than m times maximal weight (A becomes in- 
finite at this bound). Thus, the approximation can be in- 
accurate for very large scores, causing a larger than usual 
relative error in P-values (Fig. S6). However, occurence 
of such extreme scores is rarely seen in practice. 

Theoretically, Lugananni-Rice formula is valid over the 
whole range of the distribution, for small and large scores 



and both near the mean and in the tails Lugannani and 
Rice[ ( |1980| l. However, the form ( f34] > becomes numeri- 
cally unstable close to the mean of the distribution (i.e. 
when A is close to 0). Alternative asymptotic approxi- 
mations exist that are numerically stable near the mean 
Daniels ( 1987| l. For SaddleSum, we were mainly inter- 



ested in the tail probabilities and we therefore decided 
not to attempt to approximate the P-values of the scores 
smaller than one standard deviation from the mean (Sad- 
dleSum returns P-value of 1 for all such scores). Terms 
with such scores are never significant in the context of en- 
richment analysis. 



When processing a terms database, we retain previ- 
ously computed values of A with associated scores and 
parameters for Lugananni-Rice formula in a sorted array. 
Since A and the P-value are monotonic with respect to 
the score, using binary search we can certify for many 
terms that their P-value is larger than a given cutoff and 
hence eliminate them without running Newton's method. 
Furthermore, binary search provides a bracket for A and 
hence Newton's method usually converges in very few it- 
eration. We use the bracketed version of Newton's method 
recommended in the Numerical Recipes book |Press et al.\ 
(2007) (Section 9.4). This combines the classical New- 
ton's method with bisection and has guaranteed global 
convergence. 

We show evaluations of SaddleSum performance 
against some theoretically well-characterized distribu- 
tions in Fig. S5 and S6. It can be seen that the rela- 
tive error between the SaddleSum approximation and the 
theoretical P-value is generally very small except for ex- 
tremely large scores, when P-values are very small. In 
the context of the enrichment analysis, this discrepancy 
is not important because such terms will be evaluated as 
highly significant even if the P-value is off by few or- 
ders of magnitude. To further illustrate the quality of our 
approximation, we have computed the Kullback-Leibler 
(KL) divergences (relative entropies) between the tail dis- 
tribution implied by SaddleSum and the theoretical dis- 
tribution. Prior to computation of KL divergence, both 
distributions were normalized over the region where Sad- 
dleSum is valid (i.e. the tail with scores larger than one 
standard deviation over the mean). All KL divergence 
values are extremely small and are comparable between 
distributions. 

Fig. S7 shows relative errors of SaddleSum compared 
to the empirical distributions using the same weights and 
term sizes as for Fig. SI and S2. In this case however, in 
agreement with the null model of SaddleSum, we sampled 
weights with replacement. Our results indicate that, ex- 
cept for small m with weights coming from network flow 
simulations, the relative error of the SaddleSum is simi- 
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lar to that obtained in comparison with well-characterized 
distributions. 
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Figure SI: Accuracy of reported P-values from simulations using weights from 100 results of protein network information flow 
simulations. Each graph shows empirical P-values associated with reported P-value cutoffs for investigated enrichment methods, 
obtained from queries of decoy term datasets with fixed size terms. The curves for GAGE are omitted from the plots for term sizes 
5, 15 and 25 because all reported P-values were greater than 10 -2 . The graph for m = 500 misses the results for mHG because we 
could not finish the simulation runs within any reasonable amount of time. 
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Figure S2: Accuracy of reported P-values from simulations using weights from 136 microarrays. Each graph shows empirical P- 
values associated with reported P-value cutoffs for investigated enrichment methods, obtained from queries of decoy term datasets 
with fixed size terms. For SaddleSum, T-profiler and GAGE, full lines indicate the results where negative weights were set to 0, 
while dashed lines show the results using all weights. 
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Figure S3: Additional examples of sets of top-five GO terms retrieved by evaluated methods (refer to Fig. 2B for full explanation.) 
The upper two panels show the enrichment results using the weights from outputs of 1TM Probe emitting mode with human proteins 
APOAl (apolipoprotein A-I, a major protein component of high density lipoprotein in plasma) and PPP2R2A (phosphatase 2 
regulatory subunit B) as sources. The lower two panels show the results using weights from microarrays investigating mast cell 
activation (GSM73587) and malaria response (GSM63320). 
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Figure S4: Distributions of weights for examples from Fig. 2 and Fig. S3. Network examples are shown on the left, microarray on 
the right. 
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Figure S5: P-values (left) and relative errors (right) for SaddleSum approximations of sums of i.i.d. continuous random variables 
that are characterized theoretically. In each case 10000 weights were randomly sampled from a distribution and used as input to 
SaddleSum. The P-values from SaddleSum were compared with P-values from theoretical distributions of the sum of m numbers. 
Kullback-Leibler divergences (DKL) between the approximated tails of distributions are shown in parentheses for each m. Top: 
Gaussian (standard normal) weights - sum follows normal distribution. Middle: squared Gaussian weights - sum follows Chi- 
squared distribution. Bottom: weights from exponential distribution - sum follows Erlang distribution. 
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Figure S6: P-values (left) and relative errors (right) for SaddleSum approximations of sums of i.i.d. Bernoulli ({0, 1}) random 
variables with different parameter p. Such sums follow binomial distribution. In each case 10000 weights were randomly sampled 
from a distribution and used as input to SaddleSum. The P-values from SaddleSum were compared with P-values from the binomial 
distribution. Kullback-Leibler divergences between the approximated tails of distributions are shown in parenthesis for each m. 
Top: p = 0.3. Middle: p — 0.01. Bottom: p — 0.001. The dramatic increase in relative error is caused by A instability at extreme 
scores, see Section B. 
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Figure S7: Relative error of P-values reported by SaddleSum from simulations using weights from 100 results of protein network 
information flow simulations (left) and from 136 microarrays (right). These are the same query sets as evaluated in Fig. SI and 
Fig. S2 but in this case the weights are drawn with replacement. Each sample size m is shown in different color. Full lines indicate 
the results where negative weights were set to 0, while dashed lines show the results using all weights. 



